Notes: Machine Learning by Andrew Ng (Week9)

Density Estimation

Problem Motivation

Anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

Taking quality assurance test on aircraft engines as an example. In the test, the heat generated and the vibration, as the features of the aircraft engine, would be measured. If there is a new aircraft engine produced next day, we want to know whether the new engine should undergo further test.

  • $p(x_{test}) < \epsilon \rightarrow “anomaly”$
  • $p(x_{test}) \ge \epsilon \rightarrow “ok”$

image1

Anomaly detection example:

  • Fraud detection:
          $x^{(i)}$ = features of user $i$'s activities.
          Model $p(x)$ from data.
          Identify unusual users by checking which have $p(x) < \epsilon$.

  • Manufacturing

  • Monitoring computers in a data center:
          $x^{(i)}$ = features of machine $i$.
          $x_1$ = memory use, $x_2$ = number of disk accesses/sec,
          $x_3$ = CPU load, $x_4$ = CPU load/network traffic…

Gaussian Distribution (Normal Distribution)

$x \in \mathbb R$. If $x$ is a distributed Gaussian with mean $\mu$, variance $\sigma^2$: $\ X \sim \mathcal N(\mu, \sigma^2)$
$$p(x; \mu, \sigma^2) = \frac 1{\sqrt {2\pi} \sigma} e^{-\frac {(x - \mu)^2}{2 \sigma^2} }$$
$$\mu = \frac 1m \sum_{i=1}^m x^{(i)}, \quad \sigma^2 = \frac 1m \sum_{i=1}^m (x^{(i)} - \mu)^2$$

Anomaly Detection Algorithm

We are going to model $p(x)$ from training set {$x^{(1)}$, …, $x^{(m)}$}, each example $x \in \mathbb R^n$
$$p(x) = p(x_1; \mu_1, \sigma_1^2) \times p(x_2; \mu_2, \sigma_2^2) \times p(x_3; \mu_3, \sigma_3^2) \times \cdots \times p(x_n; \mu_n, \sigma_n^2)$$,
where $x1 \sim \mathcal N(\mu_1, \sigma_1^2)$, $\ x2 \sim \mathcal N(\mu_2, \sigma_2^2)$, $\ x3 \sim \mathcal N(\mu_3, \sigma_3^2)$ … (the problem of estimating the distribution $p(x)$ is sometimes called the problem of density estimation).

Hence, the algorithm is:

  • Choose features $x_i$ that might be indicative of anomalous examples.
  • Fit parameters $\mu_1, …, \mu_n, \sigma_1^2, …, \sigma_n^2$
    $$\mu_j = \frac 1m \sum_{i=1}^m x_j^{(i)}, \quad \sigma_j^2 = \frac 1m \sum_{i=1}^m (x_j^{(i)} - \mu_j)^2$$
  • Given new example $x$, compute $p(x)$:
    $$p(x) = \prod_{j=1}^n p(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^n \frac 1{\sqrt {2 \pi} \sigma_j} exp(- \frac {(x_j - \mu_j)^2}{2 \sigma_j^2})$$ Anomaly if $p(x) < \epsilon$

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

The importance of real-number evaluation: when Developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.

If there is an extra feature, should this new feature be included or not? Hence, we should run the algorithm with and without the feature, and get back a number indicating that adding the new feature improve or worsen performance, which is a much better way to decide whether that feature should be included or not.

Therefore, in order to be able to develop an anomaly detection system quickly, it would be a really helpful to have a way of evaluating an anomaly detection system.

Assume we have some labeled data, of anomalous and non-anomalous examples. (y = 0 if normal, y = 1 if anomalous). Then, the process of developing and evaluating an anomaly detection algorithm is as follows:

  • Training set: $x^{(1)}, x^{(2)}, …, x^{(m)}$ (assume normal examples / not anomalous, and it will still work if a few anomalies slip into the unlabeled training set)
  • Cross validation set: $(x_{cv}^{(1)}, y_{cv}^{(1)}), …, (x_{cv}^{(m_{cv})}, y_{cv}^{(m_{cv})})$
  • Test set: $(x_{test}^{(1)}, y_{test}^{(1)}), …, (x_{test}^{(m_{test})}, y_{test}^{(m_{test})})$

For instance, here is the aircraft engines motivating example:

  • 10000 good engines (normal)
  • 20 (2~50) flawed engines (anomalous):

Training set: 6000 good engines
CV: 2000 good engines (y = 0), 10 anomalous (y = 1)
Test: 2000 good engines (y = 0), 10 anomalous (y = 1)

Algorithm evaluation:
Fit model $p(x)$ on training set {$x^{(1)}, …, x^{(m)}$}
On a cross validation / test example $x$, predict
$$ y =
\begin{cases}
1 &\ \text{if} \ p(x) < \epsilon \ \text{(anomaly)} \cr
0 &\ \text{if} \ p(x) \ge \epsilon \ \text{(normal)}
\end{cases}
$$
Possible evaluation metrics:

  • True positive, false positive, false negative, true negative
  • Precision / Recall
  • $F_1-score$

Can also use cross validation set to choose parameter $\epsilon$

Anomaly Detection vs. Supervised Learning

Anomaly detection Supervised learning
Very small number of positive (y = 1) examples (0~20 is common), and large number of negative (y = 0) examples which can fit $p(x)$ very well Large numberof positive and negative examples
Many different “types” of anomalies, which is hard for any algorithm to learn what the anomalies look like from positive examples; future anomalies may look nothing like any of the anomalous examples we have seen so far Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples are likely to be similar to ones in the training set

A key difference is that, in anomaly detection. there is such a small number of positive examples which is not possible for a learning algorithm to learn, so instead the large set of negative examples can be used to learn $p(x)$.

Choosing What Features to Use

In the anomaly detection algorithm, one of the most important things is modeling the features using the sort of Gaussian distribution. Hence, it is recommended to plot the data or the histogram of the data to make sure that the data looks vaguely Gaussian, before feeding it to the anomaly detection algorithm.

After plotting the histogram, if it looks non-gaussian, then we should try with different transformations of the data in order to make it look more Gaussian.
image2

Error analysis for anomaly detection:
Want $p(x)$ large for normal examples $x$ and $p(x)$ small for anomalous examples $x$.

Most common problem:
$p(x)$ is comparable (say, both large) for normal and anomalous examples.
image3

In the picture above, there is an anomalous examples that have been drawn green. It gets a pretty high probability so that the algorithm fails to flag this as an anomalous example. What we should do is that, look at the training examples and find something unusual which could inspire us to come up with a new feature $x_2$ that helps to distinguish between the bad example.

Thus, after creating the new feature $x_2$, re-plot the data (the scatter in the right). For all the anomalous examples, the feature $x_2$ takes on the unusual value.

In short, the way to choose features is to choose features that will take on either very large values or very small values, which are likely to turn out to be anomalies.

Also, we can create new features by combining the existing features, such as $x_3 = \frac {x_1}{x_2}$, $x_4 = \frac {(x_1)^2}{x_2}$. By creating features like these, we can start to capture anomalies that correspond to unusual combinations of values of the features.

Multivariate Gaussian Distribution

Multivariate Gaussian Distribution

If there is a new example, using the green cross ($x_1 = 0.4, x_2 = 1.5$) to express, which has a very low CPU load and a high memory use. It looks like that should be an anomaly. Nonetheless, according to the anomaly detection algorithm, both of the features, $x_1$ and $x_2$, have a reasonably high probability. Thus, an anomaly detection algorithm will fail to flag this green cross as an anomaly.
image4

Using the multivariate Gaussian distribution to fix this problem.

Multivariate Gaussian (Normal) Distribution
$x \in \mathbb R^n$. This time, don’t model $p(x_1), p(x_2), …,$ etc. separately.
Model $p(x)$ all in one go.
Parameters: $\mu \in \mathbb R^n, \Sigma \in \mathbb R^{n \times n}$ (covariance matrix), $|\Sigma|$ = determinant of $\Sigma$ = $\color {darkblue}{det(\Sigma)}$
$$p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} |\Sigma|^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$$

Multivariate Gaussian examples:

  • For a specific value of $x_1$ and for a specific value of $x_2$, the height of the surface is the value of $p(x)$.
  • $\Sigma$ is a covariance matrix measuring the variance of the features. Shrinking $\Sigma$ would end up with a narrower distribution while incresing $\Sigma$ would end up with a wider and flatter Gaussian.
  • The multivariate distribution can be used to model correlations between the data, by changing the off diagonal entries of $\Sigma$.
  • Vary the mean parameter $\mu$ to change the center of the distribution.
    image5

Anomaly Detection Using the Multivariate Gaussian Distribution

Given training set {$x^{(1)}, x^{(2)}, …, x^{(m)}$} with each of $x^{(i)} \in \mathbb R_n$.

  • Fit model $p(x)$ by setting
    $\mu = \frac 1m \sum_{i = 1}^m x^{(i)}$
    $\Sigma = \frac 1m \sum_{i=1}^m (x^{(i)} - \mu) (x^{(i)} - \mu)^T$
  • Given a new example $x$, compute
    $p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} |\Sigma|^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$
    Flag an anomaly if $p(x) < \epsilon$

Relationship to the original model

Original model Multivariate Gaussian
$p(x_1; \mu_1, \sigma_1^2) \times \cdots \times p(x_n; \mu_n, \sigma_n^2)$ $p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} \lvert \Sigma \rvert^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$
Manually create features to capture anomalies where $x_1$, $x_2$ take unusual combinations of values, such as $x_3 = \frac {x_1}{x_2}$ Automatically captures correlations between features
Computationally cheaper (alternatively, scales better to large $n$) Computationally more expensive
OK even if $m$ (training set size) is small Must have $m > n$ or else $\Sigma$ is non-invertible ($m \ge 10n$)

Original model corresponds to multivariate Gaussian $p(x; \mu, \Sigma)$ where $\Sigma = \begin{bmatrix} \sigma_1^2 & & & \cr & \sigma_2^2 & & \cr & & \ddots & \cr & & & \sigma_n^2 \end{bmatrix}$

Predicting Movie Ratings

Problem Formulation

Example: Predicting movie ratings (user rates movies using 0 to 5 stars)

Movie Alice(1) Bob(2) Carol(3) Dave(4)
Love at last $\color{blue} 5$ $\color{blue} 5$ $\color{blue} 0$ $\color{blue} 0$
Romance forever $\color{blue} 5$ $\color{blue} ?$ $\color{blue} ?$ $\color{blue} 0$
Cute puppies of love $\color{blue} ?$ $\color{blue} 4$ $\color{blue} 0$ $\color{blue} ?$
Nonstop car chases $\color{blue} 0$ $\color{blue} 0$ $\color{blue} 5$ $\color{blue} 4$
Swords vs. karate $\color{blue} 0$ $\color{blue} 0$ $\color{blue} 5$ $\color{blue} ?$

$n_u$ = number of users, $\ n_m$ = number of movies
$r(i, j)$ = 1 if user $j$ has rates movie $i$
$y^{(i, j)}$ = rating given by user $j$ to movie $i$ (defined only if $r(i, j)$ = 1)

The recommender system problem is that, given the data $r(i, j)$ and $y^{(i, j)}$, look through the data and try to predict the missing rating values of those non-rated examples (movies in this case), in order to recommend new movies which might be interesting to a user.

Content Based Recommendations

Content-based recommender systems
image6

For each user $j$, learn a parameter $\theta^{(j)} \in \mathbb R^{n+1}$. Predict user $j$ as rating movie $i$ with $(\theta ^{(j)}) ^T x^{(i)}$ stars.

This is basically a linear regression problem.
Problem formulation
$r(i, j)$ = 1 if user $j$ has rates movie $i$ (0 otherwise)
$y^{(i, j)}$ = rating by user $j$ on movie $i$ (if defined)
$\theta ^{(j)}$ = parameter vector for user $j$
$x^{(i)}$ = feature vector for movie $i$
$m^{(j)}$ = number of movies rated by user $j$
For user $j$, movie $i$, predict rating: $(\theta ^{(j)}) ^T x^{(i)}$

Optimization objective
To learn $\theta ^{(j)}$ (parameter for user $j$):
$$\underset {\theta ^{(j)}}{min} \frac 12 \sum_{i:r(i, j) = 1} [(\theta ^{(j)}) ^T x^{(i)} -y^{(i, j)}] ^2 + \frac {\lambda}2 \sum_{k=1}^n (\theta_k^{(j)}) ^2$$
To learn $\theta ^{(1)}$, $\theta ^{(2)}$, …, $\theta ^{(n_u)}$:
$$\underset {\theta ^{(1)}, …, \theta ^{(n_u)}}{min} \frac 12 \sum_{j=1}^{n_u} \sum_{i:r(i, j) = 1} [(\theta ^{(j)}) ^T x^{(i)} -y^{(i, j)}] ^2 + \frac {\lambda}2 \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)}) ^2$$

Gradient descent update:
$$\theta_0^{(j)} = \theta_0^{(j)} - \alpha \sum_{i:r(i, j) = 1} \left ( (\theta ^{(j)}) ^T x^{(i)} - y ^{(i, j)} \right ) x_0^{(i)}$$
$$\theta_k^{(j)} = \theta_k^{(j)} - \alpha \left ( \sum_{i:r(i, j) = 1} \left ( (\theta ^{(j)}) ^T x^{(i)} - y ^{(i, j)} \right ) x_k^{(i)} + \lambda \theta_k^{(j)} \right ) \quad (\text {for} \ k \ne 0)$$

Collaborative Filtering

Collaborative Filtering

Feature learning: an algorithm learns what features to use for itself.

In reality, it is very difficult and time-consuming to let someone to watch every movie and tell how romantic or how action packed the movie is. Hence, we have no idea how romantic or action packed each movie is, and features in the table above are represented as “?”.

Then, each of our users would tell how much they like a movie by telling what the value of $\theta ^{(j)}$ is.
$\theta^{(1)} = \begin{bmatrix} 0 \cr 5 \cr 0 \end{bmatrix}, \ \theta^{(2)} = \begin{bmatrix} 0 \cr 5 \cr 0 \end{bmatrix}, \ \theta^{(3)} = \begin{bmatrix} 0 \cr 0 \cr 5 \end{bmatrix}, \ \theta^{(1)} = \begin{bmatrix} 0 \cr 0 \cr 5 \end{bmatrix}$
The values of $x_1$ and $x_2$ for each movie can be inferred if we get these parameters $\theta^{(j)}$.

Optimization algorithm
Given $\theta_{(1)}, …, \theta_{(n_u)}$, to learn $x^{(i)}$:
$$\underset {x^{i}}{min} \frac 12 \sum_{j:r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{k=1}^n (x_k^{(i)}) ^2$$

Given $\theta_{(1)}, …, \theta_{(n_u)}$, to learn $x^{(1)}, …, x^{(n_m)}$:
$$\underset {x^{1}, …, x^{n_m}}{min} \frac 12 \sum_{i=1}^{n_m} \sum_{j:r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{i=1}^{n_m} \sum_{k=1}^n (x_k^{(i)}) ^2$$

Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).

Collaborative Filtering Algorithm

Collaborative filtering optimization objective: Minimizing $x_{(1)}, …, x_{(n_m)}$ and $\theta_{(1)}, …, \theta_{(n_m)}$ simultaneously:
$$J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)}) = \frac 12 \sum_{(i, j):r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{i=1}^{n_m} \sum_{k=1}^n (x_k^{(i)}) ^2 + \frac {\lambda}2 \sum_{i=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)}) ^2$$
$$\underset {x_{(1)}, …, x_{(n_m)} \\ \theta_{(1)}, …, \theta_{(n_u)}}{min} J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)})$$

With the new algorithm, we are going to learn features $x \in \mathbb R^n$ and parameters $\theta \in \mathbb R^n$, because we are now learning all the features and th ere is no need to hard code the feature that is always equal to 1. If the algorithm really wants a feature equal to 1, it can choose to learn one for itself by setting $x_1 = 1$ instead of $x_0 = 1$.

Collaborative filtering algorithm

  • Initialize $x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)}$ to small random values.
  • Minimize $J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)})$ using gradient descent (or an advanced optimization algorithm). E.g. for every $j = 1,…, n_u, i = 1, …, n_m$:
    $x_k^{(i)} = x_k^{(i)} - \alpha \left( \sum_{j:r(i, j) = 1} \left((\theta^{(i)}) ^T x^{(i)} - y^{(i, j)} \right) \theta_k^{(j)} + \lambda x_k^{(i)} \right)$
    $\theta_k^{(j)} = \theta_k^{(j)} - \alpha \left( \sum_{i:r(i, j) = 1} \left((\theta^{(i)}) ^T x^{(i)} - y^{(i, j)} \right) x_k^{(i)} + \lambda \theta_k^{(j)} \right)$
  • For a user with parameter $\theta$ and a movie with (learned) features $x$, predict a star rating of $\theta^T x$.

Low Rank Matrix Factorization

Vectorization: Low Rank Matrix Factorization

Group all the ratings into a matrix:
$Y = \begin{bmatrix} 5 & 5 & 0 & 0 \cr 5 & ? & ? & 0 \cr ? & 4 & 0 & ? \cr 0 & 0 & 5 & 4 \cr 0 & 0 & 5 & ? \end{bmatrix}$

Given this matrix $Y$ of all the ratings, there is an alternative way of writing out all the predictive ratings of the algorithm. What user $j$ predicts on movie $i$ is given by this formula: $(\theta^{(j)}) ^T x^{(i)}$.
Let $X = \begin{bmatrix} —(x^{(1)}) ^T— \cr —(x^{(2)}) ^T— \cr \vdots \cr —(x^{(n_m)}) ^T— \end{bmatrix}$,   $\Theta = \begin{bmatrix} —(\theta^{(1)}) ^T— \cr —(\theta^{(2)}) ^T— \cr \vdots \cr —(\theta^{(n_u)}) ^T— \end{bmatrix}$
Predict ratings by computing the following matrix (low rank matrix factorization):
$$
\begin{bmatrix}
(\theta^{(1)}) ^T (x^{(1)}) & (\theta^{(2)}) ^T (x^{(1)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(1)}) \cr
(\theta^{(1)}) ^T (x^{(2)}) & (\theta^{(2)}) ^T (x^{(2)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(2)}) \cr
\vdots & \vdots & \vdots & \vdots \cr
(\theta^{(1)}) ^T (x^{(n_m)}) & (\theta^{(2)}) ^T (x^{(n_m)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(n_m)})
\end{bmatrix} = \Theta^T X
$$

Using the learned features to find related movies:
For each product $i$, we learn a feature vector $x^{(i)} \in \mathbb R^n$

Implementational Detail: Mean Normalization

In addition to the previous 4 users, add a new user, Eve, who has not rated any movies.
Therefore, the ratings matrix $Y = \begin{bmatrix} 5 & 5 & 0 & 0 & ? \cr 5 & ? & ? & 0 & ? \cr ? & 4 & 0 & ? & ? \cr 0 & 0 & 5 & 4 & ? \cr 0 & 0 & 5 & ? & ? \end{bmatrix}$

Let $n = 2$, so we are going to learn parameter vector $\theta^{(5)} \in \mathbb R^2$ with 2 features. The user Eve has not rated any movies, so the first term of the objective function, $\sum_{(i, j): r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i, j)} \right)^2$,is equal to 0, playing no role in determing $\theta^{(5)}$. Besides, there are no movies that Eve has rated, so the second term is equal to 0, too.

Hence, only the last term has an effect on the $\theta^{(5)}$.
$\frac {\lambda}2 \sum_{j=1}^{n_u} \sum_{k=1}^n = \frac {\lambda}2 [(\theta_1^{(5)}) ^2 + (\theta_2^{(5)}) ^2]$

However, the right term will end up with a vector of all zeros for minimizing, which means $\theta^{(5)} = \begin{bmatrix} 0 \cr 0 \end{bmatrix}$. Apparently, it is not useful to predict that Eve rates every movie 0 stars.

Mean Normalization:
Compute the average rating of each movie:
$\mu = \begin{bmatrix} (5+5+0+0) / 4 \cr (5+0) / 2 \cr (4+0) / 2 \cr (0+0+5+4) / 4 \cr (0+0+5+0) / 4 \end{bmatrix} = \begin{bmatrix} 2.5 \cr 2.5 \cr 2 \cr 2.25 \cr 1.25 \end{bmatrix} $

Normalize each movie to have an average rating of 0:
$\bar{Y} = Y - \mu = \begin{bmatrix} 2.5 & 2.5 & -2.5 & -2.5 & ? \cr 2.5 & ? & ? & -2.5 & ? \cr ? & 2 & -2 & ? & ? \cr -2.25 & -2.25 & 2.75 & 1.75 & ? \cr -1.25 & -1.25 & 3.75 & -1.25 & ? \end{bmatrix}$

Use the collaborative filtering algorithm to learn parameter $\theta^{j}$ and $x^{(i)}$ from the mean normalized movie ratings above.

For user $j$ on movie $i$, predict:
$\qquad (\theta^{(j)}) ^T (x^{(i)}) + \mu_i$

User 5 (Eve):
$\because \theta^{(5)} = \begin{bmatrix} 0 \cr 0 \end{bmatrix}$
$\therefore (\theta^{(5)}) ^T (x^{(i)}) = 0$

To sum up, if there is a movie without any rating, then this movie should not be recommended to anyone, so the case of a user who has not rated any movies might be more important than the case of a movie that has not gotten a single rating.